feat(benchmark-harness): TF1/SF1 measurement infra for release validation (#320)#361
Open
yfedoseev wants to merge 11 commits into
Open
feat(benchmark-harness): TF1/SF1 measurement infra for release validation (#320)#361yfedoseev wants to merge 11 commits into
yfedoseev wants to merge 11 commits into
Conversation
Adds `tools/benchmark-harness/` as a workspace crate. This is verification infrastructure, not a feature: without ground-truth scoring, "did this release improve extraction quality?" has no answer beyond gut feel and byte diffs. Phase 1–2 in place: - `tools/benchmark-harness/PLAN.md` — scoring formulas, 8-phase sequencing, risk register. Mirrors Kreuzberg's methodology so numbers are comparable across projects (#320's ask). - `benchmark-harness run --engine pdf_oxide --corpus DIR --ground-truth DIR --output JSON` — extracts each PDF with the pdf_oxide in-process adapter, scores TF1 (bag-of-words F1 on lowercase alphanumeric tokens) against a matching .md file, and emits a JSON report with per-fixture + aggregate (mean, p50, lower-tail p90) metrics. - `benchmark-harness diff BASE.json HEAD.json` — prints per-fixture regressions and exits non-zero when mean TF1 drops >0.5pp or any fixture drops >5pp. Thresholds are tunable flags. - 5 unit tests on the tokenizer / F1 scorer (identical, disjoint, empty, partial, lowercase+punct stripping). Later phases (SF1 block parser, pdftotext/pdfium adapters, consensus ground-truth fallback, vendored Kreuzberg fixtures, Makefile target) are tracked in PLAN.md and stubbed so the trait boundaries don't need to change later.
Adds `tools/benchmark-harness/src/sf1.rs`: a block-weighted F1 implementation matching Kreuzberg's methodology, so SF1 numbers we publish are directly comparable. Scoring pipeline: - Parse markdown via pulldown-cmark (tables, math, GFM) into typed blocks: Heading(1..6), Paragraph, CodeBlock, Formula, Table, ListItem, Image. Math in a paragraph promotes it to Formula, so engines that emit `$\alpha$` inline still score as a formula block. - Per-block weights: heading=2.0, code/formula/table=1.5, list=1.0, paragraph/image=0.5. Heading detection is the highest-signal layout decision; the weights reflect that. - Type-compat matrix for cross-type allowances: heading↔heading by level distance (clamped ≥0.6), list↔paragraph=0.5, paragraph↔heading=0.25, code↔formula=0.3, code↔paragraph=0.2, table↔paragraph=0.25. - Greedy matching on (content_tf1 × type_compat) with threshold 0.10 (0.20 for short blocks <5 tokens) and no-replacement assignment by descending score. - Weighted precision/recall/F1 using the matched weights on both sides. - Order score = LIS length of matched ext indices (sorted by gt index) / match count. 1.0 = perfectly preserved order; 0.5 = half the matches are out of place. The per-fixture report gains sf1, sf1_precision, sf1_recall, order_score, matched_blocks. Aggregate gains sf1_mean/p50/p90 and order_mean. `diff` prints mean TF1, SF1, order deltas — gate thresholds still TF1-only for now (SF1 gating needs calibration on a real corpus first to avoid false positives from parser differences). 10 new unit tests cover block parsing (headings/paragraphs/code/tables), identical-input SF1=1, disjoint content SF1≈0, heading-level-mismatch partial compat, reversed-order order_score=0.5, LIS basics, weight taxonomy, and h1↔h2 / h1↔h6 compat values.
phases 4–8) Finishes the benchmark harness. Phases 4–8 in one commit. Engine adapters (phase 4) - `pdftotext` subprocess adapter wrapping poppler's `pdftotext -layout`. Probes the binary once at startup so a missing install fails fast, not per fixture. Honours `PDFTOTEXT_BIN` for non-standard locations. - `pdfium` adapter behind the `pdfium` feature (default off, since the crate needs a prebuilt native library). Uses `pdfium-render` and falls back between system library and `PDFIUM_DYNAMIC_LIB_PATH`. Consensus-baseline ground truth (phase 5) - `--consensus-peers pdftotext,pdfium` on `run` (mutually exclusive with `--ground-truth`). Per PDF, runs the peers, takes the token intersection of ≥N (default 2) peers, and scores the target engine against it. SF1 is skipped in consensus mode (needs block stream, not a token set) so numbers aren't misleading. - Report gains a `reference` field: `"manual"` vs `"consensus(pdftotext,pdfium)"`. Prevents downstream readers from confusing inter-engine agreement with absolute quality. - 3 unit tests on the consensus token set + scoring (min-agree, peers exceed threshold, partial overlap). Fixtures (phase 6) - `scripts/fetch-fixtures.sh`: clones Kreuzberg (pinned via `KREUZBERG_REF`, default `main`) into `.fixture-src/`, symlinks `tools/benchmark-harness/fixtures/kreuzberg → tools/benchmark-harness/fixtures` from the upstream. Re-runnable; idempotent. Don't vendor PDFs directly — per-fixture licenses inside Kreuzberg's corpus vary. Makefile + README (phase 8) - `make benchmark-fetch` — runs the fetch script - `make benchmark-run` — `cargo run --release -p benchmark-harness -- run --engine $(ENGINE) …` - `make benchmark-compare` — diff with regression gate - README documents scoring formulas, invocation, engine matrix, JSON report schema, and license posture. Tests: 18 total (5 TF1 + 10 SF1 + 3 consensus). Clippy clean under `-D warnings`. Release branch build path unaffected — crate is a new workspace member behind a cfg-less `cargo run -p benchmark-harness`. Release-validation workflow this enables: git checkout main && make benchmark-run OUTPUT=base.json git checkout feat/X && make benchmark-run OUTPUT=head.json make benchmark-compare BASE=base.json HEAD=head.json → non-zero exit on meaningful TF1 regression, tuneable thresholds.
Two bugs found by the first local run on the Kreuzberg corpus:
- Fetch script pointed DEST at the upstream's fixture *metadata*
directory, but the PDFs and ground-truth markdown actually live
under test_documents/{pdf,ground_truth/pdf}. Flatten both into
${DEST}/pdfs and ${DEST}/gt as symlinks so the harness's
stem-matching loader just works.
- walkdir by default skips symlinks, so every stem-matched pair was
invisible. Enable follow_links(true) on both walkers.
- Makefile CORPUS/GROUND_TRUTH point at the flattened subdirs.
- Add .gitignore for the upstream clone + generated symlink forest so
re-running the fetch script never contaminates the working tree.
First numbers on the 102-pair intersection (TF1 mean):
pdf_oxide : 0.919 pdftotext : 0.946 Δ: -2.7pp
Detailed analysis follows in a separate artefact.
Running the harness end-to-end on Kreuzberg's 102-pair PDF corpus turned up real pdf_oxide bugs, which is the whole point. Captured the findings in BASELINE_ISSUES.md: Headline numbers (engine vs pdftotext, TF1): mean 0.919 / 0.946 (Δ -2.7pp) p50 0.965 / 0.984 (Δ -1.9pp) p10 0.776 / 0.881 (Δ -10.5pp) ← biggest gap on hard fixtures Four issues identified, ranked by blast radius: - B1: extract_text(n) returns identical content per page on some linearized PDFs (nougat_005.pdf: TF1 0.254 vs pdftotext 0.924). Page index appears to resolve to page 0 for every call. - B2: empty-page false positives on text-heavy pages (pdfa_010 pages 2/9/11 return 0 bytes; pdftotext emits 400–2000 each). - B3: running-artifact detector suppresses cover-page titles when they happen to overlap with per-page running headers (pdfa_010 loses "University of Oklahoma 2009"; same class as the 5PFVA6 case from the v0.3.31 sweep). - B4: XY-cut reading-order loses content on multi-column / dashboard layouts (order_mean 0.80 vs 0.86, nougat_026, pdfa_001, etc.). All four are existing pdf_oxide bugs that the 170-PDF byte diff couldn't catch (bytes matched across branches because both carry the bug). Now we have a verification pipeline with numbers.
Numbers on the Kreuzberg 102-fixture corpus with the B1 fix merged in: TF1 mean 0.919 → 0.925 (+0.64pp) TF1 p10 0.776 → 0.848 (+7.2pp) ← hard-tail improvement SF1 mean 0.337 → 0.339 (+0.22pp) runtime 8.3 s → 5.7 s (−31%) Zero per-fixture regressions. The worst-in-corpus fixture nougat_005 moved from TF1 0.254 to 0.901 — now essentially at parity with pdftotext's 0.924 on that file. This validates the harness workflow end-to-end: harness found a bug, fix landed with TDD coverage, rerun quantifies the improvement, diff subcommand gates against any accidental regression. Drop tools/.gitignore that came in from the fix branch — on the benchmark-harness branch the tools/benchmark-harness/ crate is the whole point and must stay tracked.
After merging B1 and B3 into the harness branch, the Kreuzberg 102-fixture benchmark shows: TF1 mean 0.919 → 0.927 (+0.77pp) TF1 p10 0.776 → 0.849 (+7.3pp) ← hard tail SF1 mean 0.337 → 0.343 (+0.54pp) order 0.804 → 0.819 (+1.5pp) runtime 8.3s → 5.6s (-33%) Zero per-fixture regressions at either fix. Supersedes B1_RESULTS.md. B2 closed as not-a-bug — post-B1 no fixture has pdf_oxide returning empty where pdftotext succeeds; pdfa_010's empty pages turned out to be genuinely empty in both tools. B4 deferred — multi-column reading-order wants XY-cut promoted to default in extract_text, which is an architectural change with enough blast radius to warrant its own validation cycle. Tracked; nougat_026/pdfa_001 at order_score ~0.4 are the canaries for it.
XY-cut as default reading order for multi-column pages is correct (synthetic TDD test passes) but the Kreuzberg corpus aggregate shows neutral impact: TF1 mean 0.927 → 0.927 (+0.04pp) SF1 mean 0.343 → 0.342 (−0.09pp) order 0.819 → 0.817 (−0.19pp) Per-fixture: ~6 wins (nougat_011/012, pdfa_048) at +5..+10pp, ~5 losses (nougat_033, pdfa_008, pdfa_037) at −2..−14pp, and a long tail of no-ops. Interpretation captured in RESULTS.md: XY-cut is semantically right, but Kreuzberg's ground-truth markdown was generated from content-stream-order serialisers, so on single-column pages where content-stream ≈ row-aware, our fix loses SF1 points against a GT that's "less correct in the same way". This is exactly the kind of corpus-bias artefact the harness exists to surface — no amount of heuristic tightening will improve the aggregate without disabling the wins. No per-fixture TF1 regression > 0.5pp; diff gate passes. Keeping the fix since the synthetic test proves correctness on clearly-multi- column input; the real corpus-level improvement needs better GT.
This was referenced Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #320. Release-verification infrastructure as a workspace crate at
tools/benchmark-harness/. Computes TF1 (token F1) and SF1 (block-weighted structural F1 with LIS ordering) against ground-truth markdown, so "did this release improve extraction quality?" has an answer beyond gut feel and byte diffs.The methodology mirrors Kreuzberg's benchmark-harness so numbers are comparable to their published reports.
What's in
benchmark-harness run --engine <E> --corpus DIR --ground-truth DIR --output JSONandbenchmark-harness diff BASE.json HEAD.jsonwith a configurable regression gate (default: fail on mean TF1 drop > 0.5pp or per-fixture drop > 5pp).pdf_oxide(in-process),pdftotext(subprocess),pdfium(behind--features pdfiumsince the crate needs a prebuilt native lib). Adapter trait insrc/engine.rs— one enum arm + one impl per new engine.src/score.rs); SF1 pulldown-cmark block parser + type-compat matrix + greedy match + weighted P/R/F1 + LIS order penalty (src/sf1.rs). All formulas documented inPLAN.mdandREADME.md.--consensus-peers pdftotext,pdfiumuses peer agreement as pseudo-ground-truth when no manual reference exists. Labels the reportreference=consensus(...)so absolute quality and inter-engine agreement never get confused.scripts/fetch-fixtures.shclones Kreuzberg's Apache-2.0 corpus (pinned viaKREUZBERG_REF) and symlinks 154 PDFs + 180 ground-truth markdown files intofixtures/kreuzberg/. We don't vendor upstream PDFs directly — per-fixture licenses vary.make benchmark-fetch,make benchmark-run ENGINE=<E> OUTPUT=<F>,make benchmark-compare BASE=<F> HEAD=<F>.Why this matters
The 170-PDF byte-diff regression sweep we'd been using couldn't tell us how good extraction was — only that it didn't change. The harness immediately found a real bug (B1: shared Form XObject per-page CTM regression causing every page to return page 0's content) that byte-diff couldn't because both branches had the same bug. TF1 p10 moved from 0.776 → 0.849 (+7.3pp) once B1 was fixed.
Commit series
Six phased commits so reviewers can see the scoring reveal itself:
faf51b2)5d9c990)bf1eaef)8ec1dcb)3794409)99c6084)0dd0310)829d858)671cd6e)Test plan
cargo test -p benchmark-harness— 18 tests passcargo clippy -p benchmark-harness --all-targets -- -D warnings— cleanmake benchmark-fetch→make benchmark-run→make benchmark-comparevalidates six separate bug fixes (B1, B3, B4, B7, B8a, B9) with no per-fixture regressions > 0.5pp.Follow-up bug fixes using this harness
Separate PRs against
release/v0.3.31:And the combined branch with all six fixes + this harness: fix/all-benchmark-bugfixes.